{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. UIE Introduction\n",
    "\n",
    "[UIE(Universal Information Extraction)](https://arxiv.org/pdf/2203.12277.pdf)：Yaojie Lu et al. proposed the universal information extraction unified framework UIE in ACL-2022. The framework achieves the unified modeling of entity extraction, relationship extraction, event extraction, sentiment analysis and other tasks, and makes different tasks have good migration and generalization capabilities. In order to facilitate the use of UIE's powerful capabilities, PaddleNLP released the first Chinese general information extraction model, UIE, based on the ERNIE 3.0 knowledge enhancement pretrained model. The model supports key information extraction without limitation of the industry field and extraction domain, achieving excellent few-shot finetune ability to adapt to specific extraction domain.\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/167236006-66ed845d-21b8-4647-908b-e1c6e7613eb1.png  width=\"800\"/>\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Application & Model Effects\n",
    "\n",
    "\n",
    "- Medical Scenario - Specialized Disease Structured\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/169017581-93c8ee44-856d-4d17-970c-b6138d10f8bc.png width=\"700\"/>\n",
    "</div>\n",
    "\n",
    "- Legal Scenario - Judgment Extraction\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/169017863-442c50f1-bfd4-47d0-8d95-8b1d53cfba3c.png width=\"700\"/>\n",
    "</div>\n",
    "\n",
    "\n",
    "- Financial scenario - income certificate and prospectus extraction\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/169017982-e521ddf6-d233-41f3-974e-6f40f8f2edbc.png width=\"700\"/>\n",
    "</div>\n",
    "\n",
    "- Social scenario - accident report extraction\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/169018340-31efc1bf-f54d-43f7-b62a-8f7ce9bf0536.png width=\"700\"/>\n",
    "</div>\n",
    "\n",
    "- Tourism scenario - brochure extraction\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/169018113-c937eb0b-9fd7-4ecc-8615-bcdde2dac81d.png width=\"700\"/>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We conducted experiments on three major private testsets: Internet, medical and finance:\n",
    "\n",
    "<table>\n",
    "<tr><th row_span='2'><th colspan='2'>Finance<th colspan='2'>Medical<th colspan='2'>Internet\n",
    "<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot\n",
    "<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86\n",
    "<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68\n",
    "<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38\n",
    "<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22\n",
    "<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35\n",
    "<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td><b>78.49</b><td><b>83.02</b>\n",
    "<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13\n",
    "</table>\n",
    "\n",
    "0-shot refers to the predicted results directly from ``` paddlelp. Taskflow ```, and 5-shot refers to the predicted results from fine-tuned model with 5 annotated data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3.Usage\n",
    "\n",
    "```paddlenlp.Taskflow``` provides the ability to extract general information, evaluation opinions, etc. It can extract various types of information, including but not limited to named entity recognition (such as person name, place name, organization name, etc.), relationship (such as film director, song release time, etc.), events (such as an accident at an intersection, an earthquake at a place, etc.), evaluation dimensions, opinion words, emotional orientation, and other information. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text without training。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1 Environment\n",
    "Install PaddleNLP"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install --upgrade paddlenlp\n",
    "! pip show paddlenlp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pprint import pprint\n",
    "from paddlenlp import Taskflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 Entity extraction\n",
    "\n",
    "Named Entity Recognition (NER) refers to identifying entities with specific meanings in text. In open domain information extraction, there is no limit to the categories extracted, and users can define them themselves.\n",
    "\n",
    "For example, the extracted target entity types are \"时间\", \"选手\" and \"赛事名称\". The example is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction\n",
    "ie = Taskflow('information_extraction', schema=schema, model='uie-base')\n",
    "ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')\n",
    "pprint(ie(\"2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌！\")) # Better print results using pprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 Relationship Extraction\n",
    "Relation Extraction (RE) refers to identifying entities from text and extracting semantic relationships between entities to obtain triple information: subject,predicate and object.\n",
    "\n",
    "For example, \"竞赛名称\" is taken as the extraction subject, and the extraction relationship types are \"主办方\", \"承办方\" and \"已举办次数\". The example is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction\n",
    "ie.set_schema(schema) # Reset schema\n",
    "pprint(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办，百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办，已连续举办4届，成为全球最热门的中文NLP赛事之一。'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.3 Event extraction\n",
    "\n",
    "Event Extraction (EE) refers to extracting predefined event trigger words and event arguments from natural language texts and combining them into corresponding event structured information.\n",
    "\n",
    "For example, the target of extraction is \"地震强度\", \"时间\", \"震中位置\" and \"震源深度\" of the \"地震\" event. The example is as follows:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction\n",
    "ie.set_schema(schema) # Reset schema\n",
    "ie('中国地震台网正式测定：5月16日06时08分在云南临沧市凤庆县(北纬24.34度，东经99.98度)发生3.5级地震，震源深度10千米。')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.4 Extraction of comments\n",
    "\n",
    "Comment and opinion extraction refers to the extraction of evaluation dimensions and opinion words contained in the text.\n",
    "\n",
    "For example, the target of extraction is the evaluation dimension contained in the text and its corresponding opinion words and sentiment tendencies. The example is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = {'评价维度': ['观点词', '情感倾向[正向，负向]']} # Define the schema for opinion extraction\n",
    "ie.set_schema(schema) # Reset schema\n",
    "pprint(ie(\"店面干净，很清静，服务员服务热情，性价比很高，发现收银台有排队\")) # Better print results using pprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The english example is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]\n",
    "ie_en.set_schema(schema)\n",
    "pprint(ie_en(\"The teacher is very nice.\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.5 Sentiment classification\n",
    "Sentence level sentiment classification, that is, to judge whether the sentence's sentimentorientation is \"positive\" or \"negative\". The example is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = '情感倾向[正向，负向]' # Define the schema for sentence-level sentiment classification\n",
    "ie.set_schema(schema) # Reset schema\n",
    "ie('这个产品用起来真的很流畅，我非常喜欢')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The english example is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = 'Sentiment classification [negative, positive]'\n",
    "ie_en.set_schema(schema)\n",
    "ie_en('I am sorry but this is the worst film I have ever seen in my life.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.6 Cross scenario extraction\n",
    "\n",
    "For example, in a legal scenario, both entity extraction and relationship extraction are performed on the text. The example is as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]\n",
    "ie.set_schema(schema)\n",
    "pprint(ie(\"北京市海淀区人民法院\\n民事判决书\\n(199x)建初字第xxx号\\n原告：张三。\\n委托代理人李四，北京市 A律师事务所律师。\\n被告：B公司，法定代表人王五，开发公司总经理。\\n委托代理人赵六，北京市 C律师事务所律师。\")) # Better print results using pprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.7 Model training\n",
    "For simple extraction targets, you can directly use ```paddlenlp.Taskflow```implements zero shot extraction. For private scenarios, we recommend using the customization function (mark a small amount of data for model fine-tuning) to further improve the effect. For details of model training, please refer to [UIE Training](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Related papers and citations\n",
    "\n",
    "- **[Unified Structure Generation for Universal Information Extraction](https://arxiv.org/pdf/2203.12277.pdf)**\n",
    "- **[Quantizing deep convolutional networks for efficient inference: A whitepaper](https://arxiv.org/pdf/1806.08342.pdf)**\n",
    "- **[PACT: Parameterized Clipping Activation for Quantized Neural Networks](https://arxiv.org/abs/1805.06085)**"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
